Today we’re going to talk about reproducibility, why it’s such a widely-discussed topic, and ways we can mitigate against irreproducible results.
We start the story in 1955 where a few scientists in Israel started a parody magazine about science titled The Journal of Irreproducible Results. It contained a mix of jokes, satire of scientific practice, science cartons, and a discussion of funny but real research.
Fast forward 61 years to 2016. Nature put together a survey asking about reproducibility in research with 1500+ respondents. What they found that was that more than 70% of researchers had tried and failed to reproduce another scientist’s experiments, and more than half admitted to failing to reproduce their own experiments. Also, 52% of the respondents said they believed there was a “reproduciblity crisis” in science.
This was a project out of Duke University. The authors published paper in Nature stating that they could predict whether a patient would respond to drug treatments based on what genes were being expressed in the patient. This obviously got a lot of people very excited.
Including these two statisticians (Keith Baggerly (left), Kevin Coombes (right)) from MD Anderson Cancer Center in Houston, TX. Picture from New York Times.
Short version of story:
Three years later, they published their analysis in The Annals of Applied Statistics in 2009, a journal that medical scientists rarely read.
It’s a beautiful and yet very sobering paper to read that goes through five different case studies, not just the one being discussed here, stating how simple errors in a data analysis or in the experimental design can potentially put patients at risk.
So meanwhile, many lawsuits and injuctions were filed to try and stop these clinical trials from going forward.
Duke finally started an internal review of the accusations.
In 2012, The Cancer Letter reported that Anil Potti had falsified parts of his resume stating he was a Rhodes scholar, but was not. The New York Times wrote several articles on it. Keith and Kevin were interviewed on 60 Minutes.
In the end, Duke settled with families of eight cancer patients who participated in clinical trials. Four papers were retracted. Duke shut down three trials using the results. Potti resigned from Duke. You can read about how all the events unfolded here.
You can watch a video from Keith explaining the entire thing. It’s definitely worth a watch!
A lot of these can be frame around the idea of a lack of statistical training or the misunderstanding of statistics and and some form of experimental design or data analysis. But, several are about making your analyses, methods, code and data available.
A wise man once said:
“Your closest collaborator is you six months ago, but you don’t reply to emails.” -Karl Broman
Organizing your code in a way that allows you to easily figure out what you did six months ago and is that reproducible are two ways to avoid these problems.
Even if you have never used a version control tool, you’ve probably already done it manually: copying and renaming project folders (“paper-v1.doc”, “paper-v2.doc”, “paper-final.doc”, “paper_finalFINALdraft.doc”, etc.) is a form of version control.
Consider the following folder structure:
What is version control?
There are several options for version control:
We will use git and GitHub in this course.
Git is a tool that automates and enhances a lot of the tasks that arise when dealing with larger, longer-living, and collaborative projects. It has also become the common underpinning to many popular online code repositories, GitHub being the most popular. GitHub is a online service that permits you to organize and share your code in what are called repositories.
If you ask 10 people, you’ll get 10 different answers, but one of the commonalities is that most people do not realize how integral it is to their development process until they have started using it. Still, for the sake of argument, here are some highlights:
There are many great tutorials on helping you install and set up git and GitHub. Here are a few:
We will be using the command line version of git. If you are unfamiliar with working on the command line, try reading through the Command line interface which is from the Data Science Specialization course on Coursera. There are many other fantastic tutorials for working on the command line.
git and GitHub workflowThe first thing to understand about git is that the contents of your project are stored in several different states and forms at any given time. If you think about what version control is, this might not be surprising: in order to remember every change that has ever been made, you need to store a record of those changes somewhere, and to be able to handle multiple people changing the same code, you need to have different copies of the project and a way to combine them.
You can think about git operating on four different areas:
You will move your code between these different areas using the following workflow:
git add..git directory in the working directory of your project. Files are moved from the index to the local repository via the command git commit.git push, and in the other direction using git fetch.You can think of most git operations as moving code (or other metadata) between the local and remote repositories.
Git is more effective when used at a fine granularity. For starters, you can’t undo what you haven’t committed, so committing lots of small changes makes it easier to find the right rollback point. Also, merging becomes a lot easier when you only have to deal with a handful of conflicts.
Git is meant for tracking changes. In nearly all cases, the only meaningful difference between the contents of two binaries is that they are different. If you change source files, compile, and commit the resulting binary, git sees an entirely different file. The end result is that the git repository (which contains a complete history, remember) begins to become bloated with the history of many dissimilar binaries. Worse, there’s often little advantage to keeping those files in the history. An argument can be made for periodically snapshotting working binaries, but things like object files, compiled files, and editor auto-saves are basically wasted space.
Git comes with a built-in mechanism for ignoring certain types of files. Placing filenames or wildcards in a .gitignore file placed in the top-level directory (where the .git directory is also located) will cause git to ignore those files when checking file status. This is a good way to ensure you don’t commit the wrong files accidentally, and it also makes the output of git status somewhat cleaner.
I cannot understate the importance of this.
Commit messages are a way of quickly telling your future self (and your collaborators) what the commit was about. For even a moderately sized project, digging through tens or hundreds of commits to find the change that you are looking for is a nightmare without friendly summaries.
By convention, commit messages start with a single-line summary, then an empty line, then a more comprehensive description of the changes.
This is an okay commit message. The changes are small, and the summary is sufficient to describe what happened.
This is better. The summary captures the important information (major shift, direct vs. helper), and the full commit message describes what the high-level changes were.
This. Don’t do this.
gitgit routineNow that you understand the basics, try doing the following on your own GitHub account:
Also, you can check out GitHub Bootcamp for more help.
jhu-advdatasci/2018 repository on GitHubThe lectures and homework assignments for this course are housed in the jhu-advdatasci/2018 repository on GitHub. Until today, if you were unfamiliar with using git and GitHub, you probably have manually downloaded the R Markdown files for each lecture.
After class today:
git clone to clone the 2018 remote repository to your own computer. This step only needs to be completed once.git pull. These last step will be repeated throughout the semester.Read through this to learn about how to use GitHub Classroom to get your homework assignments.